Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax. #Title: Introduction to open Data Science 2019
I join this course to discover the world of data science. Beside, I would like to make friend with people who work in the health science area. I am currently a PhD student of the University of Eastern Finland.
Describe the work you have done this week and summarize your learning. ##Data wrangling step Prepare data from original file R script is here Data is here
*First install.packages(c(“dplyr”,“GGally”,“tidyverse”,“xlsx”,“ggplot2”))
library(dplyr)
library (ggplot2)
library(readxl)
setwd(getwd())
library(dplyr)
learning2014 <- readxl::read_excel("data/learning2014.xlsx") %>%
mutate_at(vars(gender), factor)
dim (learning2014)
head (learning2014)
str(learning2014)
Data includes seven variables: gender,age, attitude, deep, stra, surf, points. In which gender: M (Male), F (Female) age:Age (in years) derived from the date of birth attitude: Global attitude toward statistics deep: average of score related to deep learning stra: average of score related to surf: average of score related to surface questions points: Exam points full data from this :https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-data.txt
View (learning2014)
#describe some variables
as.numeric(learning2014$age)
## [1] 53 55 49 53 49 38 50 37 37 42 37 34 34 34 35 33 32 44 29 30 27 29 31
## [24] 37 26 26 30 33 33 28 26 27 25 31 20 39 38 24 26 25 30 25 30 48 24 40
## [47] 25 23 25 23 27 25 23 23 23 45 22 23 21 21 21 21 26 25 26 21 23 22 22
## [70] 22 23 22 23 24 22 23 22 20 22 21 22 23 21 22 29 29 21 28 21 30 21 23
## [93] 21 21 20 22 21 21 20 22 20 20 24 20 19 21 21 22 25 21 22 25 20 24 20
## [116] 21 20 20 31 20 22 22 21 22 21 21 21 21 20 21 25 21 24 20 21 20 20 18
## [139] 21 19 21 20 21 20 21 20 18 22 22 24 19 20 20 17 19 20 20 20 20 19 19
## [162] 22 35 18 19 21
summaryvar <- learning2014 %>%summary (c("age", "deep", "stra", "surf", "points" ))
## Warning in if (length(ll) > maxsum) {: the condition has length > 1 and
## only the first element will be used
print (summaryvar)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
library(dplyr)
library (GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggplot2)
library(tidyverse)
## -- Attaching packages ------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
#see the correlation between variables
pairs.panels (learning2014)
Each individual plot shows the relationship between the variable in the horizontal vs the vertical of the grid gender is the discrete variable so we just forcus on continous variables like: age, attitude, deep, stra, surf, points
For example: correlation between age and attitude = 0.02 attitude toward statistics of people in the survey has positive correlation with age, deep learning, strategy, but negative correlation with surface learning
The diagonal is showing a histogram of each variable and it can be seen on the graph that point and attitude has a strong correlation (r = 0.44)
library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014,
mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20))
)
Test association between points and attitude, deep, stra
cor.test(learning2014$attitude,learning2014$points)
##
## Pearson's product-moment correlation
##
## data: learning2014$attitude and learning2014$points
## t = 6.2135, df = 164, p-value = 4.119e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3044463 0.5521335
## sample estimates:
## cor
## 0.4365245
cor.test(learning2014$deep,learning2014$points)
##
## Pearson's product-moment correlation
##
## data: learning2014$deep and learning2014$points
## t = -0.12997, df = 164, p-value = 0.8967
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1622192 0.1423931
## sample estimates:
## cor
## -0.01014849
cor.test(learning2014$stra,learning2014$points)
##
## Pearson's product-moment correlation
##
## data: learning2014$stra and learning2014$points
## t = 1.8916, df = 164, p-value = 0.06031
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.006340149 0.291945961
## sample estimates:
## cor
## 0.1461225
Results of correlation in the cor.test between attitude and points = 0.4365245 (p_value= 4.119e-09)
Results of correlation in the cor.test between deep and points = -0.01014849 (p_value= 0.8967)
Results of correlation in the cor.test between stra and points = 0.1461225 (p_value= 0.06031)
Look at the p_value: association between points and deep and stra is not statistically significant relationship so remove deep and stra variables from model
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3 + ggtitle( "Relation between attitude and points")
print (p4)
The equation for the model is \[ Y_i = \alpha + \beta_1 X_i + \epsilon_i \] where Y represent points, X is attitude, \(\alpha\) is constant, \(\beta_1\) is regression coefficient for attitude, and \(\epsilon\) is a random term.
m1 <- lm(learning2014$points ~learning2014$attitude, data = learning2014)
results <- summary(m1)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 11.637 | 1.830 | 6.358 | 0 |
| learning2014$attitude | 3.525 | 0.567 | 6.214 | 0 |
Intercept as well as attitude are statistically significant predictors. Coefficient of determination \(R^2\) = 0.1905537 that is not particularly high.
plot(m1, which=c(1,2,5))
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires.The datasets provides information regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks.
*Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details). The data (n=382) has 35 different variables and is a joined data of the two datasets.
+school: binary variables: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira
Characteristics of participants: sex , age
Demographic and family information including variables: Address, famsize, Pstatus, Medu, Fedu,
Some variables about educational information: absences - numeric variable- number of school absences, failures: - number of past class failures, nursery: - attended nursery school (binary: yes or no), internet: - Internet access at home (binary: yes or no), guardian: - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
G1, G2, G3: express grades are related with the course subject, Math or Portuguese.
So I want to see how age (age). sex (sex), a romantic relationship (* romantic ), and number of school absences (absences*) variables effect to cohort consumption.
The hyopthesis is that getting older, being male, not in good relationship with partner, and more days of school absence will increase alcohol consumption.
g1 <- ggplot(data = dat, aes(x = high_use, fill = sex))
g1 + geom_bar()
From the plot 1, it can be seen that male take higher proportion than female in high_use.
g2 <- ggplot(data = dat, aes(x = high_use, fill = romantic ))
g2 + geom_bar()
It can be seen here somehow not in a romantic relationship took more proportion of datohol consumption compare to in a romantic relationship
g3 <- ggplot(data = dat, aes(x = high_use, y= absences))
g3 + geom_boxplot()
More day absences from school increased datohol consumption
m <- glm(high_use ~ age+ sex + romantic+ absences, data = dat, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + romantic + absences, family = "binomial",
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7179 -0.8456 -0.6286 1.0823 2.1893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.54258 1.70433 -2.665 0.00769 **
## age 0.16915 0.10226 1.654 0.09810 .
## sexM 0.99159 0.24303 4.080 4.50e-05 ***
## romanticyes -0.29584 0.26585 -1.113 0.26579
## absences 0.07283 0.01828 3.983 6.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 422.16 on 377 degrees of freedom
## AIC: 432.16
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind (OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01064589 0.0003570945 0.290015
## age 1.18430022 0.9701859445 1.450229
## sexM 2.69551085 1.6842578252 4.375275
## romanticyes 0.74390627 0.4371001910 1.242806
## absences 1.07554516 1.0398995842 1.116863
Interpret result: + Age: getting older has associate to increase datohol usage, however, the effect of age variable is not significant (P-value = 0.10 )
Being a male increase probability use more datohol than 2.69 being female (OR = 2.69), the result is statistical significant
In a romantic relationship decrease alcohol consumption, however, the effect of this variable is not significant (P-value = 0.26)
The number of day absences from school increase the probability of use more datohol 1.07 (95%CI = 1.04 - 1.11)
Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.
probabilities <- predict(m, type = "response")
dat <- mutate(dat, probability = probabilities)
dat <- mutate(dat, prediction = probabilities > 0.5)
# tabulate the target variable versus the predictions
table(high_use = dat$high_use, prediction = dat$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 261 9
## TRUE 88 24
So the model so
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)}
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
cv <- cv.glm(data = dat, cost = loss_func, glmfit = m, K = nrow(dat))
summary (cv)
## Length Class Mode
## call 5 -none- call
## K 1 -none- numeric
## delta 2 -none- numeric
## seed 626 -none- numeric
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library (tidyverse)
library (corrplot)
## corrplot 0.84 loaded
library (plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#Load dataset Boston from MASS
data("Boston")
#look at data Boston
head (Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
#look at the structure of dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
data(Boston)
Data summary: This is the dataset about housing Values in Suburbs of Boston Description: The Boston data frame has 506 rows and 14 columns with explaination of variables below:
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
## NULL
## crim zn indus chas
## Min. :-0.3900 Min. :-0.5700 Min. :-0.7100 Min. :-0.12000
## 1st Qu.:-0.2150 1st Qu.:-0.4050 1st Qu.:-0.3825 1st Qu.:-0.04750
## Median : 0.3200 Median :-0.2550 Median : 0.3950 Median : 0.02000
## Mean : 0.1786 Mean :-0.0550 Mean : 0.1929 Mean : 0.08143
## 3rd Qu.: 0.4500 3rd Qu.: 0.2775 3rd Qu.: 0.6300 3rd Qu.: 0.09000
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## nox rm age dis
## Min. :-0.770 Min. :-0.61000 Min. :-0.7500 Min. :-0.7700
## 1st Qu.:-0.360 1st Qu.:-0.29750 1st Qu.:-0.2625 1st Qu.:-0.5225
## Median : 0.305 Median :-0.21500 Median : 0.3050 Median :-0.3050
## Mean : 0.190 Mean :-0.01286 Mean : 0.1736 Mean :-0.1464
## 3rd Qu.: 0.655 3rd Qu.: 0.19000 3rd Qu.: 0.5775 3rd Qu.: 0.2400
## Max. : 1.000 Max. : 1.00000 Max. : 1.0000 Max. : 1.0000
## rad tax ptratio black
## Min. :-0.4900 Min. :-0.5300 Min. :-0.5100 Min. :-0.44000
## 1st Qu.:-0.2850 1st Qu.:-0.3050 1st Qu.:-0.2175 1st Qu.:-0.37750
## Median : 0.4600 Median : 0.4850 Median : 0.2250 Median :-0.22500
## Mean : 0.2371 Mean : 0.2364 Mean : 0.1157 Mean :-0.06071
## 3rd Qu.: 0.6075 3rd Qu.: 0.6475 3rd Qu.: 0.3775 3rd Qu.: 0.16750
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## lstat medv
## Min. :-0.7400 Min. :-0.74000
## 1st Qu.:-0.4000 1st Qu.:-0.46000
## Median : 0.4150 Median :-0.38000
## Mean : 0.1407 Mean :-0.06857
## 3rd Qu.: 0.5775 3rd Qu.: 0.31000
## Max. : 1.0000 Max. : 1.00000
In fig1 shows the distribution of variables and their relationship. For example:chas variables is the categorical variable
#scale variables
boston_scaled <- scale (Boston)
summary (boston_scaled%>%round (digits = 2))
## crim zn indus
## Min. :-0.420000 Min. :-0.490000 Min. :-1.560000
## 1st Qu.:-0.410000 1st Qu.:-0.490000 1st Qu.:-0.870000
## Median :-0.390000 Median :-0.490000 Median :-0.210000
## Mean :-0.000119 Mean :-0.002154 Mean :-0.001304
## 3rd Qu.: 0.010000 3rd Qu.: 0.050000 3rd Qu.: 1.010000
## Max. : 9.920000 Max. : 3.800000 Max. : 2.420000
## chas nox rm
## Min. :-0.270000 Min. :-1.46e+00 Min. :-3.880000
## 1st Qu.:-0.270000 1st Qu.:-9.10e-01 1st Qu.:-0.570000
## Median :-0.270000 Median :-1.40e-01 Median :-0.110000
## Mean : 0.001838 Mean : 5.93e-05 Mean : 0.000138
## 3rd Qu.:-0.270000 3rd Qu.: 6.00e-01 3rd Qu.: 0.480000
## Max. : 3.660000 Max. : 2.73e+00 Max. : 3.550000
## age dis rad
## Min. :-2.330000 Min. :-1.270000 Min. :-9.80e-01
## 1st Qu.:-0.837500 1st Qu.:-0.800000 1st Qu.:-6.40e-01
## Median : 0.315000 Median :-0.280000 Median :-5.20e-01
## Mean : 0.000415 Mean :-0.000138 Mean : 5.93e-05
## 3rd Qu.: 0.907500 3rd Qu.: 0.660000 3rd Qu.: 1.66e+00
## Max. : 1.120000 Max. : 3.960000 Max. : 1.66e+00
## tax ptratio black
## Min. :-1.3100000 Min. :-2.70000 Min. :-3.900000
## 1st Qu.:-0.7700000 1st Qu.:-0.49000 1st Qu.: 0.202500
## Median :-0.4600000 Median : 0.27500 Median : 0.380000
## Mean : 0.0001383 Mean : 0.00164 Mean :-0.000079
## 3rd Qu.: 1.5300000 3rd Qu.: 0.81000 3rd Qu.: 0.430000
## Max. : 1.8000000 Max. : 1.64000 Max. : 0.440000
## lstat medv
## Min. :-1.53000 Min. :-1.9100000
## 1st Qu.:-0.79750 1st Qu.:-0.5975000
## Median :-0.18000 Median :-0.1400000
## Mean : 0.00002 Mean : 0.0000988
## 3rd Qu.: 0.60000 3rd Qu.: 0.2700000
## Max. : 3.55000 Max. : 2.9900000
#to see class of the scaled object
class (boston_scaled)
## [1] "matrix"
#change boston_scaled from matrix into data frame
boston_scaled <-as.data.frame(boston_scaled)
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
#creat a quantile vector of crim
bins <- quantile(boston_scaled$crim)
#see quantile parts of bins
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low","med_high", "high") , include.lowest = TRUE)
table (crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
All means of variables move to 0. Now the crime variable is set to 4 categories according to how high is the per capita crime rate by town.
# linear discriminant analysis with target variable crime
lda_crime <- lda(crime ~ ., data = train)
lda_crime
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2673267 0.2425743 0.2475248 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 0.9495409 -0.9199965 -0.12651054 -0.8639591 0.43404501
## med_low -0.0680943 -0.3348050 -0.03128211 -0.5772463 -0.10626698
## med_high -0.3929105 0.2309388 0.08200995 0.4193641 0.05619791
## high -0.4872402 1.0149946 -0.03128211 1.0506227 -0.43184272
## age dis rad tax ptratio
## low -0.8451067 0.8750523 -0.6851839 -0.7352272 -0.40629552
## med_low -0.3744090 0.3926826 -0.5365473 -0.4405401 -0.06241517
## med_high 0.4518157 -0.4274656 -0.4110831 -0.2988490 -0.32080881
## high 0.8062673 -0.8539045 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.38275292 -0.75645050 0.481992416
## med_low 0.30585520 -0.16128368 -0.005896954
## med_high 0.06399668 0.06575168 0.134411187
## high -0.85153829 0.87940650 -0.714305390
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.117462952 0.545422837 -1.03471026
## indus 0.002425109 -0.397923258 0.14855148
## chas -0.087693725 0.032985312 0.10848084
## nox 0.275768413 -0.763331270 -1.20273382
## rm -0.116106199 -0.043194029 -0.13947095
## age 0.293425740 -0.296062208 -0.16136188
## dis -0.098528392 -0.167558107 0.28843272
## rad 3.367732753 0.833063749 -0.38742641
## tax 0.073670474 0.183675900 0.91067114
## ptratio 0.151042926 0.063378885 -0.22150150
## black -0.160182566 0.004527969 0.09739340
## lstat 0.174288616 -0.273400891 0.46644625
## medv 0.171870311 -0.470455401 -0.00897782
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9516 0.0372 0.0111
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
# plot lda with 2 discriminants
plot(lda_crime, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_crime, myscale = 1)
# eigenvalues:
lda_crime$svd
## [1] 44.318707 8.767059 4.792523
The purpose of linear discriminant analysis (LDA) is to find the linear combinations of the original variables (in Boston dataset) that gives the best possible separation between the groups.
We separate the crime into 4 groups: “low”, “med_low”,“med_high”, “high”. so the number of groups G=4, and the number of variables is 13 (p=13). The maximum number of useful discriminant functions that can separate is the minimum of G-1 and p. Thus, we can find in the result 3 useful discriminant functions (Proportion of trace: LD1, LD2, LD3) to separate.
Ratio of the between- and within-group standard deviations on the linear discriminant variables is higher -> model is more precious. LD1 = 0.96 means this model can discriminate 96% difference between groups
In the above picture we can see the linear combination of the variables that separate the crime classes.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda_crime, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 3 2 0
## med_low 4 19 5 0
## med_high 0 12 13 1
## high 0 0 1 28
In the above picture we can see how the model perdicts the Crime classes. The model is able to predict closely to real value in “high”" and “low”" rates.
Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
data("Boston")
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled)
dist_eu <- dist(Boston)
summary (dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
dist_man <- dist(Boston, method = 'manhattan')
summary (dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
# k-means clustering
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster )
#plot focus on
pairs(Boston[6:10], col = km$cluster)
#K-means: determine the k
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
According to the qplot above, the optimal number of clusters is around 2, because afther that the WCSS changes dramatically
km <-kmeans(Boston, centers = 2)
pairs(Boston , col = km$cluster)
pairs(Boston[1:5] , col = km$cluster)
pairs(Boston[6:10] , col = km$cluster)
pairs(Boston[11:14] , col = km$cluster)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda_crime$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_crime$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
## Please upgrade via install.packages('htmlwidgets').